fix custom exception handling #14575

arcadiaphy · 2019-03-31T10:54:10Z

Description

This PR fixes Problem 1 in #14522:

The exception thrown in custom thread cannot be caught in main thread, causing program crash.

It stores exception of custom op and re-throws in sync function.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

wkcn

Thank you for the fix! LGTM

anirudh2290

Thanks for working on this ! I put little more thought into this . All we need is to check whether func (try catch around func for dmlc::Error) and save the exception. check if exception_ptr is set in PushSync and if so rethrow exception inside PushSync. This should handle both 1 and 2 mentioned here: #14522

EDIT: Approach I mentioned will still only work for 1. We will hit situation 2 when there is no sync call in the forward or backward. I think custom op wasn't designed to handle this use case and it is tricky to support it now. I looked at all examples, tests and also customer code sockeye, and most of these codes have some sync call at the end like self.assign assigning output ndarrays. We should make this limitation explicit in CustomOp documentation for now.

larroy · 2019-04-01T18:44:07Z

src/operator/custom/custom-inl.h

-        fn();
+        try {
+          task.fn();
+        } catch (dmlc::Error& e) {


What about other types of exceptions? Could we add a comment in the code clarifying this?

What other types? I think the only valid exception is dmlc::Error, other exception means wrong code.

larroy · 2019-04-01T18:45:29Z

tests/python/unittest/test_operator.py

@@ -5237,6 +5238,17 @@ def custom_add():
        p.join(5)
        assert not p.is_alive(), "deadlock may exist in custom operator"

+    # test except handling
+    # see https://github.com/apache/incubator-mxnet/pull/14575
+    def custom_add_exc():


Can we add a comment that an exception is expected due to shapes I assume?

Sure, I'll add it.

piyushghai · 2019-04-02T17:53:56Z

@mxnet-label-bot Add [Backend, Operator]

arcadiaphy · 2019-04-04T16:35:05Z

@anirudh2290 I've simplified the exception catching according to your suggestion.

Agreed that the situation 2 is really tricky, because for ExecType::kAsync op, the on_complete callback is async to op computation, and there are no mechanism to ensure on_complete not skipped when exception happens.

Now ExecType::kAsync is only used in custom op.

anirudh2290 · 2019-04-04T18:09:50Z

src/engine/threaded_engine.cc

  BulkFlush();
  ThreadedVar* threaded_var = ThreadedVar::CastFromBase(var);
  if (threaded_var->ready_to_read()) {
    ThrowException(threaded_var);
+    CustomOperator::Get()->ThrowException();


Lets remove this. ThreadedEngine should not depend on custom operator code.

anirudh2290 · 2019-04-04T18:17:39Z

src/operator/custom/custom-inl.h

+            std::make_shared<std::exception_ptr>(std::current_exception());
+        ctx.async_on_complete();
+        return;
+      }


I think we can solve both 1 and 2 this way: After func is called do wait_to_read on all elements in arrs. Then catch and save. Remove lines 104 and 105. In PushSync, check if exception is set and rethrow exception. Also catch it and call async_on_complete in pushsync. and return.

Something like the following:

Engine::Get()->PushSync( [=](RunContext rctx) { try { if (exception_) { std::rethrow_exception(exception_); } } catch(dmlc::Error& err) { ctx.async_on_complete(&err); return; } }

Thanks to this support added for horovod: #13932 we may be able to leverage this to call async_on_complete with the error.

Adding wait_to_read in custom op can solve 1&2, and it can be treated as normal op without using ExecType::kAsync.

we probably still need PushSync for the Sparse ndarray updates.

we still need ExecType::kAsync. Custom operator is still async and when push is called it just pushes it into its custom op worker queue for execution later. Async will ensure that the threaded_engine_pooled and threaded_engine_per_device treat it as a special case and execute immediately instead of pushing the work again to one of the engine worker thread queue. Pushing to engine worker thread queue is unnecessary for custom op.

After testing, ExecType::kAsync is really needed. Adding wait_to_read in engine worker thread will cause deadlock.
But PushSync can be removed and works well.

we probably still need it for sparse. since for sparse we are updating chunk it is a write option. WaitToRead may not be enough.

Yes, I also add WaitToWrite to make sure there's no left out exceptions.

overall looks good. can you run the example/reinforcement-learning/dqn which includes a custom op on CPU to check for performance and make sure there is no perf impact.

arcadiaphy · 2019-04-12T14:31:01Z

@anirudh2290 I'll test the example you've mentioned.

Let me mention another problem here:
Adding WaitToRead in custom thread will block the thread, which will cause deadlock. Suppose we only have one custom thread and custom op A is dependent upon custom op B (A->B), then deadlock will happen due to not enough working custom thread to use for B.

import mxnet as mx

class A(mx.operator.CustomOp):
    def __init__(self):
        super(A, self).__init__()
    def forward(self, is_train, req, in_data, out_data, aux):
        out_data[0][:] = mx.nd.Custom(in_data[0], in_data[1], op_type='B')
    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        in_grad[0][:] = out_grad[0]
        in_grad[1][:] = out_grad[0]

@mx.operator.register("A")
class AProp(mx.operator.CustomOpProp):
    def __init__(self):
        super(AProp, self).__init__()
    def list_arguments(self):
        return ['a', 'b']
    def list_outputs(self):
        return ['output']
    def infer_shape(self, in_shape):
        return in_shape, [in_shape[0]]
    def create_operator(self, ctx, shapes, dtypes):
        return A()

class B(mx.operator.CustomOp):
    def __init__(self):
        super(B, self).__init__()
    def forward(self, is_train, req, in_data, out_data, aux):
        out_data[0][:] = in_data[0] + in_data[1]
    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        in_grad[0][:] = out_grad[0]
        in_grad[1][:] = out_grad[0]

@mx.operator.register("B")
class BProp(mx.operator.CustomOpProp):
    def __init__(self):
        super(BProp, self).__init__()
    def list_arguments(self):
        return ['a', 'b']
    def list_outputs(self):
        return ['output']
    def infer_shape(self, in_shape):
        return in_shape, [in_shape[0]]
    def create_operator(self, ctx, shapes, dtypes):
        return B()

def main():
    a = mx.nd.array([1, 2])
    b = mx.nd.array([4, 5])
    c = mx.nd.Custom(a, b, op_type='A')
    print c.asnumpy()

if __name__ == '__main__':
    main()

Run the code with:

MXNET_CUSTOM_OP_NUM_THREADS=1 python reproduce.py

Even if you have n threads, we can still deadlock the execution with n+1 custom op dependency chain: A->B->C...

We should make sure there are always enough threads to use, maybe remove MXNET_CUSTOM_OP_NUM_THREADS and create new thread when in need?

wkcn · 2019-04-12T15:57:30Z

@arcadiaphy Custom Operator creates new thread when in need now, and MXNET_CUSTOM_OP_NUM_THREADS is the maximum number of threads.

Removing this line https://github.com/apache/incubator-mxnet/blob/master/src/operator/custom/custom-inl.h#L185 to support unlimited number of threads.

anirudh2290 · 2019-04-13T01:52:59Z

@arcadiaphy Yes good point, the code you linked will cause a deadlock. this is also true if custom op forward or backward has asnumpy() or wait_to_read at the end even before applying your change, when you change MXNET_CUSTOM_OP_NUM_THREADS=1. Even before the custom op multi threaded change this was the case.

Also, I am concerned about the performance because of the blocking call during op execution. I think we have reached a point where there is no option but to change the approach and special case "CustomOperator" in ExecuteOprBlock by checking whether op is CustomOperator and if so, execute the operator instead of skipping (https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine.h#L377) . I was trying to avoid this to keep the engine agnostic of the op its executing but this seems like the least impactful solution currently.

arcadiaphy · 2019-04-13T02:29:44Z

@anirudh2290 If the engine code remains unchanged, I think enabling unlimited number of custom threads when needed is a quick fix.

arcadiaphy · 2019-04-13T03:09:08Z

Also for unlimited threads, theoretically, there should be no performance impact.

anirudh2290 · 2019-04-13T07:56:04Z

I think we should make the modification in engine. Keeping that env variable is fine as Users may want to limit the custom op threads, so that customers don't run into situation where there are too many custom ops in their model and lot of threads are spawned and thread spawning overhead being high. Also, performance wise adding the waittoread and waittowrite can impact performance: every thread is now pushing all ops, waiting it to complete and then exiting as compared to custom op threads earlier which would just push op to engine and exit. Custom op threads were lightweight earlier and we should keep it that way.

arcadiaphy · 2019-04-13T08:30:47Z

@anirudh2290 How about leave this PR temporarily and create a new PR to modify the engine code?

anirudh2290 · 2019-04-13T08:35:40Z

Yes I am fine with that

arcadiaphy · 2019-04-16T08:07:57Z

close due to issue fixed in #14693

arcadiaphy added 2 commits March 31, 2019 18:46

fix custom exception handling

da74a63

add test

2b1da9e

wkcn approved these changes Mar 31, 2019

View reviewed changes

anirudh2290 previously requested changes Mar 31, 2019

View reviewed changes

larroy reviewed Apr 1, 2019

View reviewed changes

marcoabreu added Backend Issues related to the backend of MXNet Operator labels Apr 2, 2019

simplify catch exception

e43f13d

add comment

5bce27e

anirudh2290 reviewed Apr 4, 2019

View reviewed changes

anirudh2290 mentioned this pull request Apr 4, 2019

[Discussion] 1.5.0 Roadmap #14619

Closed

arcadiaphy added 2 commits April 6, 2019 00:29

remove custom from engine

23aee98

add sync func

963be55

arcadiaphy requested a review from szha as a code owner April 13, 2019 07:35

unlimited custom thread

931aca9

arcadiaphy mentioned this pull request Apr 14, 2019

Properly handling custom op exception by modify engine #14693

Merged

7 tasks

arcadiaphy closed this Apr 16, 2019

arcadiaphy deleted the pr_custom_except branch April 16, 2019 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix custom exception handling #14575

fix custom exception handling #14575

arcadiaphy commented Mar 31, 2019

wkcn left a comment

anirudh2290 left a comment •

edited

Loading

larroy Apr 1, 2019

arcadiaphy Apr 4, 2019

larroy Apr 1, 2019

arcadiaphy Apr 4, 2019

piyushghai commented Apr 2, 2019

arcadiaphy commented Apr 4, 2019 •

edited

Loading

anirudh2290 Apr 4, 2019

anirudh2290 Apr 4, 2019

arcadiaphy Apr 5, 2019 •

edited

Loading

anirudh2290 Apr 5, 2019

anirudh2290 Apr 5, 2019

arcadiaphy Apr 5, 2019

anirudh2290 Apr 5, 2019

arcadiaphy Apr 6, 2019

arcadiaphy commented Apr 12, 2019 •

edited

Loading

wkcn commented Apr 12, 2019

anirudh2290 commented Apr 13, 2019 •

edited

Loading

arcadiaphy commented Apr 13, 2019

arcadiaphy commented Apr 13, 2019

anirudh2290 commented Apr 13, 2019

arcadiaphy commented Apr 13, 2019

anirudh2290 commented Apr 13, 2019

arcadiaphy commented Apr 16, 2019

fix custom exception handling #14575

fix custom exception handling #14575

Conversation

arcadiaphy commented Mar 31, 2019

Description

Checklist

Essentials

Changes

Comments

wkcn left a comment

Choose a reason for hiding this comment

anirudh2290 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piyushghai commented Apr 2, 2019

arcadiaphy commented Apr 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcadiaphy Apr 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arcadiaphy commented Apr 12, 2019 • edited Loading

wkcn commented Apr 12, 2019

anirudh2290 commented Apr 13, 2019 • edited Loading

arcadiaphy commented Apr 13, 2019

arcadiaphy commented Apr 13, 2019

anirudh2290 commented Apr 13, 2019

arcadiaphy commented Apr 13, 2019

anirudh2290 commented Apr 13, 2019

arcadiaphy commented Apr 16, 2019

anirudh2290 left a comment •

edited

Loading

arcadiaphy commented Apr 4, 2019 •

edited

Loading

arcadiaphy Apr 5, 2019 •

edited

Loading

arcadiaphy commented Apr 12, 2019 •

edited

Loading

anirudh2290 commented Apr 13, 2019 •

edited

Loading